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Assistant Commissioner for Patents and Trademarks 
Washington, D.C. 20231 
BOX: PCT 

Dear Sir: 

Prior to calculating the National Stage filing fee, cancel claims 1-68 and insert 
the following new claims: 

l.(new) A method for the co-articulation-specific concatenation of audio segments, in 
order to generate synthesised acoustical data which reproduces a sequence of 
concatenated sounds/ phones, comprising the following steps: 

- selecting at least two audio segments which contain bands, each of which reproducing 
a portion of a sound/phone or a portion of a sound/phone sequence, 

- establishing a band to be used of an earlier audio segment; 

- establishing a band to be used of a later audio segment, which begins with the later 
audio segment and ends with the co-articulation band of the later audio segment which 
follows the initially used solo articulation band; 



- with the duration and position of the bands to be used being determined as a function 
of the earlier and later audio segments; and 

- concatenating the established band of the earlier audio segment with the established 
band of the later audio segment, in that the instance of concatenation, as a function of 
properties of the used band of the later audio segment, is set in a band which begins 
immediately before the used band of the later audio segment and ends with same. 

2. (new) The method according to Claim 1, characterised in that 

- the instance of concatenation is set in a band which lies in the vicinity of the boundaries 
of the initially to be used solo articulation band of the later audio segment, if the band of 
same to be used reproduces a static sound/phone at the beginning; and 

- a downstream portion of the band to be used of the earlier audio segment and an 
upstream portion of the band to be used of the later audio segment are processed by 
means of suitable transfer functions and added in an overlapping manner (cross fade), 
with the transfer functions and the length of an overlapping portion of the two bands 
being detennined depending on the audio segments to be concatenated. 

3. (new) The method according to Claim 1, characterised in that 

- the instance of concatenation is set in a band which lies immediately before the band 
to be used of the later audio segment, if the used band of same reproduces a dynamic 
sound/ phone at the beginning; and 

- a downstream portion of the band to be used of the earlier audio segment and an 
upstream portion of the band to be used of the later audio segment are processed by 
means of suitable transfer functions and joined in a non-overlapping manner (hard fade), 
with the transfer functions being determined depending on the acoustical data to be 
synthesised. 

4. (new) The method according to Claim 1 characterised in that for a sound/phone or a 
portion of the sequence of concatenated sounds/phones at the start of the concatenated 
sound/phone sequence a band of an audio segment is selected so that the start of the band 
reproduces the properties of the start of the concatenated sound/phone sequence. 
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5. (new) The method according to Claim 1 characterised in that for a sound/phone or a 
portion of the sequence of concatenated sounds/phones at the end of the concatenated 
sound/phone sequence a band of an audio segment is selected so that the end of the band 
reproduces the properties of the end of the concatenated sound/phone sequence. 

6. (new) The method according to Claim 1 characterised in that the voice data to the 
synthesised is combined in groups, each of which being described by an individual audio 
segment. 

7. (new) The method according to Claim 1 characterised in that an audio segment is 
selected for the later audio segment band, which reproduces the highest number of 
successive portions of the sounds/phones of the sound/phone sequence, in order to use 
the smallest number of audio segment bands in the generation of the synthesised 
acoustical data. 

8. (new) The method according to Claim 1 characterised in that a processing of the used 
bands of individual audio segments is carried out by means of suitable functions 
depending on properties of the concatenated sound/phone sequence, with these properties 
involving i.a. a modification of the frequency, the duration, the amplitude, or the spec- 
trum. 

9. (new) The method according to Claim 1 characterised in that a processing of the used 
bands of individual audio segments is carried out by means of suitable functions in a 
band, in which the instance of concatenation lies, with these functions involving i.a. a 
modification of the frequency, the duration, the amplitude, or the spectrum. 

10. (new) The method according to Claim 1 characterised in that the instance of 
concatenation is set in places of the bands to be used of the earlier and/ or later audio seg- 
ment, in which the two used bands are in agreement with respect to one or several 
suitable properties, with these properties including i.a.: zero point, amplitude values, 
gradients, derivatives of any degree, spectra, tone levels, amplitude values within a 
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frequency band, volume, style of speech, emotion of speech, or other properties covered 
in the phone classification scheme. 

1 1 .(new) The method according to Claim 1 characterised in that 

- the selection of the used bands of individual audio segments, their processing, their 
variation, as well as their concatenation are additionally carried out with the application 
of heuristic knowledge which is obtained by an additionally carried out heuristic method. 

12. (new) The method according to Claim 1 characterised in that 

- the acoustical data to be synthesised is voice data, and the sounds are phones. 

13. (new) The method according to Claim 2 characterised in that 

- the static phones include vowels, diphtongs, liquids, vibrants, fricativen and nasals. 

14. (new) The method according to Claim 3 characterised in that and 

- the dynamic phones include plosives, affricates, glottal stops, and click sounds. 

15. (new) The method according to Claim 1 characterised in that 

- a conversion of the synthesised acoustical data to acoustical signals and/or voice signals 
is carried out. 

16. (new) A device for the co-articulation-specific concatenation of audio segments, in 
order to generate synthesised acoustical data which reproduces a sequence of phones, 
comprising; 

- a database (107) in which audio segments are stored, each of which reproducing portion 
of a phone or portions of a sequence of (concatenated) phones; 

- and/or any upstream synthesis means (108) which supplies audio segments; 

- ameans (105) for the selection of at least two audio segments from the database (107) 
and/ or the upstream synthesis means (108); and 
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- a means (111) for the concatenation of audio segments, characterised in that the 
concatenation means (1 1 1) is suited for 

- defining a band to be used of an earlier audio segment; 

- defining a portion to be used of a later audio segment in a band which starts with the 
later audio segment and ends after a co-articulation band of the later audio segment, 
which follows after the initially used solo articulation band; 

- determining the duration and position of the used bands depending on the earlier and 
later audio segments; and 

- concatenating the used band of the earlier audio segment with the used band of the later 
audio segment by defining the instance of concatenation as a function of properties of the 
used band of the later audio segment in a band which starts immediately before the used 
band of the later audio segment and ends with same. 

17. (new) The device according to Claim 16, characterised in that the concatenation 
means (111) comprises: 

- means for the concatenation of the used band of the earlier audio segment with the used 
band of the later audio segment, whose used band reproduces a static phone at the 
beginning in the vicinity of the boundaries of the initially occurring solo articulation band 
of the used band of the later audio segment; 

- means for processing a downstream portion of the used band of the earlier audio 
segment and an upstream portion of the used band of the later audio segment by suitable 
transfer functions; and 

- means for the overlapping addition of the two bands in an overlapping portion (cross 
fade), which depends on the audio segments to be concatenated, with the transfer 
functions and the length of an overlapping portion of the two bands being determined 
depending on the acoustical data to be synthesised. 

1 8. (new) The device according to Claim 1 6 characterised in that the concatenation (111) 
means comprises: 
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- means for the concatenation of the used band of the earlier audio segment with the used 
band of the later audio segment, whose used band reproduces a dynamic phone at the 
beginning, immediately before the used band of the later audio segments 

- means for processing a downstream portion of the used band of the earlier audio 
segment and an upstream portion of the used band of the later audio segment by suitable 
transfer functions, with the transfer functions being determined depending on the 
acoustical data to be synthesised; and 

- means for the non-overlapping joining of the two audio segments. 

19. (new) The device according to Claim 16 characterised in that the database (107) 
includes audio segments or the upstream synthesis means (108) supplies audio segments 
which comprise bands which at the start reproduce a phone or a portion of the 
concatenated phone sequence at the start of the concatenated phone sequence. 

20. (new) The device according to Claim 16 characterised in that the database (107) 
includes audio segments or the upstream synthesis means (108) supplies audio segments 
which comprise bands, whose ends reproduce a phone or a portion of the concatenated 
phone sequence at the end of the concatenated phone sequence. 

21. (new) The device according to Claim 16 characterised in that the database (107) 
includes a group of audio segments or the upstream synthesis means (108) supplies audio 
segments which comprise bands, whose starts each reproduce only a static phone. 

22. (new) The device according to Claim 16 characterised in that the concatenation 
means (111) comprises: 

- means for the generation of further audio segments by concatenation of audio segments, 
with the starts of the bands each reproducing a static phone, each with a band of a later 
audio segment whose used band reproduces a dynamic phone at the start, and 

- a means which supplies the further audio segments to the database ( 1 07) or the selection 
means (105). 
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23. (new) The device according to Claim 16 characterised in that, in the selection of the 
audio segment bands from the database ( 1 07) or the upstream synthesis means (108), the 
selection means (105) is suited to select the audio segments which reproduce the greatest 
number of successive portions of concatenated phones of the concatenated phone se- 
quence. 

24. (new) The device according to Claim 16 characterised in that the concatenation 
means (111) comprises means for processing the used bands of individual audio segments 
with the aid of suitable functions, depending on properties of the concatenated phone 
sequence, with the functions involving among others a modification of the frequency, the 
duration, the amplitude, or the spectrum. 

25. (new) The device according to Claim 16 characterised in that 

- the concatenation means (111) comprises means for processing the used bands of 
individual audio segments with the aid of suitable functions in a band including the 
instance of concatenation, with this function involving i.a. a modification of the 
frequency, the duration, the amplitude, or the spectrum. 

26. (new) The device according to Claim 16 characterised in that 

- the concatenation means (111) comprises means for the selection of the instance of 
concatenation in a place in the used bands of the earlier and/or the later audio segment, 
in which the two used bands are in agreement with respect to one or several suitable 
properties, with these properties including i.a.: zero points, amplitude values, gradients, 
■derivatives of any degree, spectra, tone levels, amplitude values in a frequency band, 
volume, style of speech, emotion of speech, or other properties covered in the phone 
classification scheme. 

27. (new) The device according to Claim 16 characterised in that 

- the selection means (105) comprises means for the implementation of heuristic 
knowledge which relates to the selection of the used bands of the individual audio 
segments, their processing, their variation, as well as their concatenation. 
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28. (new) The device according to Claim 16 characterised in that 

- the database (107) includes audio segments or the upstream synthesis means (108) 
supplies audio segments which include bands, each of which reproducing at least a 
portion of a sound or phone, respectively, a sound or phone, respectively, portions of 
phone sequences or polyphones, respectively, or sound sequences or polyphones, 
respectively. 

29. (new) The device according to Claim 17 characterised in that 

the data base (107) includes audio segments or the upstream synthesis means (108) 
supplies audio segments, with a static sound corresponding to a static phone and 
comprising vowels, diphtongs, liquids, vibrants, fricatives, and nasals. 

30. (new) The device according to Claim 18 characterised in that 

- the database (107) includes audio segments or the upstream synthesis means (108) 
supplies audio segments, with a dynamic sound corresponding to a dynamic phone and 
comprising plosives, affricates, glottal stops, and klick speech. 

3 1 .(new) The device according to Claim 16 characterised in that 

- the concatenation means (1 1 1) is suitable to generate synthesised voice data by means 
of the concatenation of audio segments. 

32. (new) The device according to Claim 16 characterised in that 

- means (117) are provided for the conversion of the synthesised acoustical data to 
acoustical signals and/or voice signals. 

3 3. (new) A data carrier which includes a computer program for the co-articulation- 
specific concatenation of audio segments in order to generate synthesised acoustical data 
which reproduces a sequence of concatenated phones, comprising the following steps: 

- selection of at least two audio segments which contain bands, each of which 
reproducing a portion of a sound/phone or a portion of a sound/phone sequence, 
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characterised by the steps of: 

- establishing a band to be used of an earlier audio segment; 

- establishing a band to be used of a later audio segment, which begins with the later 
audio segment and ends with the co-articulation band of the later audio segment which 
follows the initially used solo articulation band; 

- with the duration and position of the bands to be used being determined as a function 
of the earlier and later audio segments; and 

- concatenating the established band of the earlier audio segment with the established 
band of the later audio segment, in that the instance of concatenation, as a function of 
properties of the used band of the later audio segment, is set in its established band which 
starts immediately before the band to be used of the later audio segment and ends with 
same. 

34. (new) The data carrier according to Claim 33, characterised in that the computer 
program selects the instance of the concatenation of the used band of the second audio 
segment with the used band of the first audio segment in such a manner that 

- the instance of concatenation is set in a band which lies in the vicinity of the boundaries 
of the initially used solo articulation band of the later audio segment, if its used band 
reproduces a static phone at the start; 

- a downstream portion of the used band of the earlier audio segment and an upstream 
portion of the used band of the later audio segment are processed by suitable transfer 
functions and added in an overlapping manner (cross fade), with the transfer functions 
and the length of an overlapping portion of the two bands being determined depending 
on the audio segments to be concatenated. 

3 5. (new) The data carrier according to Claim 33 characterised in that the computer 
program selects the instance of the concatenation of the used band of the second audio 
segment with the used band of the first audio segment in such a manner that 

- the instance of concatenation is set in a band which lies immediately before the used 
band of the later audio segment, if its used band reproduces a dynamic phone at the start; 
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- a downstream portion of the used band of the earlier audio segment and an upstream 
portion of the used band of the later audio segment are processed by suitable transfer 
functions and added in a non-overlapping manner (hard fade), with the transfer functions 
being determined depending on the audio segments to be concatenated. 

36. (new) The data carrier according to Claim 33 characterised in that the computer 
program selects a band of an audio segment for a phone or a portion of the sequence of 
concatenated phones at the start of the concatenated phone sequence, the start of which 
reproduces the properties of the start of the concatenated sequence of phones. 

37. (new) The data carrier according to Claim 33 characterised in that the computer 
program selects a band of an audio segment for a phone or a portion of the sequence of 
concatenated phones at the end of the concatenated phone sequence, the end of which 
reproduces the properties of the end of the concatenated sequence of phones. 

38. (new) The data carrier according to Claim 33 characterised in that the computer 
program carries out a processing of the used bands of individual audio segments with the 
aid of suitable functions depending on properties of the phone sequence, with the 
functions involving i.a. modification of the frequency, the duration, the amplitude, or the 
spectrum. 

3 9. (new) The data carrier according to Claim 33 characterised in that the computer 
program selects an audio segment band for the later audio segment band which re- 
produces the highest number of successive portions of the concatenated phones in the 
phone sequence, in order to use the smallest number of audio segment bands in the 
generation of the synthesised acoustical data. 

40. (new) The data carrier according to Claim 39 characterised in that the computer 
program carries out a processing of the used bands of individual audio segments with the 
aid of suitable functions in a band in which the instance of concatenation lies, with these 
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functions involving i.a. a modification of the frequency, the duration, the amplitude, or 
the spectrum. 

41. (new) The data carrier according to Claim 33 characterised in that the computer 
program establishes the instance of concatenation in a place of the used bands of the first 
and/or the second audio segment, in which the two used bands are in agreement with re- 
spect to one or several suitable properties, with these properties including i.a.: zero 
points, amplitude values, gradients, derivatives of any degree, spectra, tone levels, 
amplitude values in a frequency band, volume, style of speech, emotion of speech, or 
other properties covered in the phone classification scheme. 

42. (new) The data carrier according to Claim 33 characterised in that the computer 
program carries out an implementation of heuristic knowledge which relates to the 
selection of the used bands of the individual audio segments, their processing, their 
variation, as well as their concatenation. 

43. (new) The data carrier according to Claim 33 characterised in that the computer 
program is suited for the generation of synthesised voice data, with the sounds being 
phones. 

44. (new) The data carrier according to Claim 34 characterised in that the computer 
program is suited for the generation of static phones, with the static phones comprising 
vowels, diphtongs, liquids, vibrants, fricatives, and nasals. 

45. (new) The data carrier according to Claim 35 characterised in that the computer 
program is suited for the generation of dynamic phones, with the dynamic phones 
comprising plosives, affricates, glottal stops, and klick speech. 

46. (new) The data carrier according to Claim 33 characterised in that the computer 
program converts the synthesised acoustical data to acoustical convertible data and/ or 
voice signals. 
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47. (new) Synthesised voice signals which consist of a sequence of sounds or phones, 
respectively, with the voice signals being generated in that: 

- at least two audio segments are selected which reproduce the sounds or phones, 
respectively; and 

■ the audio segments are linked by a co-articulation-specific concatenation, with 

- one band to be used of an earlier audio segment being established; 

- one band to be used of a later audio segment being established which starts with the 
later audio segment and ends with the co-articulation band of the later audio segment, 
following the initially used solo articulation band; 

- with the duration and position of the bands to be used being determined depending on 
the audio segments; and 

- the used bands of the audio segments being concatenated in a co-articulation-specific 
manner, in that the instance of concatenation, as a function of properties of the used band 
of the later audio segment, is set in a band which starts immediately before the used band 
of the later audio segment and ends with same. 

48. (new) The synthesised voice signals according to Claim 47, characterised in that the 
voice signals are generated in that 

- the audio segments are concatenated in an instance which lies in the vicinity of the 
boundaries of the later audio segment, if the start of this band reproduces a static sound 
or phone, respectively, with the static phone being a vowel, a diphtong, a liquid, a 
fricative, a vibrant, or a nasal; and 

- a downstream portion of the used band of the earlier audio segment and an upstream 
portion of the used band of the later audio segment are processed by means of suitable 
transfer function and both bands are added in an overlapping manner (cross fade), with 
the transfer functions and the length of an overlapping portion of the two bands being 
determined depending on the audio segments to be concatenated. 

49. (new) The synthesised voice signals according to Claim 47 characterised in that the 
voice signals are generated in that 
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- the audio segments are concatenated in an instance which lies immediately before the 
used band of the later audio segment, if the start of this band reproduces a dynamic sound 
or phone, respectively, with the dynamic phone being a plosive, an affricate, a glottal 
stop, or klick speech; and 

- a downstream portion of the used band of the earlier audio segment and an upstream 
portion of the used band of the later audio segment are processed by means of suitable 
transfer functions and both bands are joined in a non-overlapping manner (hard fade), 
with the transfer functions being determined depending on the audio segments to be 
concatenated. 

50. (new) The synthesised voice signals according to Claim 47 characterised in that 

- the first sound or the first phone, respectively, or a portion of the first phone sequence 
or of the first polyphone, respectively, in the sequence is generated by an audio segment, 
whose used band at the start reproduces the properties of the start of the sequence. 

51 .(new) The synthesised voice signals according to Claim 47 characterised in that 

- the last sound or the last phone, respectively, or a portion of the last phone sequence or 
of the last polyphone, respectively, in the sequence is generated by an audio segment, 
whose used band at the end reproduces the properties of the end of the sequence. 

52. (new) The synthesised voice signals according to Claim 47 characterised in that 

- the voice signals are generated in that later bands of audio segments, beginning with the 
reproduction of a dynamic sound or phone, respectively, are concatenated with earlier 
bands of audio segments, beginning with the reproduction of a static sound or phone, 
respectively. 

53. (new) The synthesised voice signals according to Claim 47 characterised in that 

- such audio segments are selected which reproduce the highest number of portions of 
sounds or phones, respectively, of the sequence, in order to use the smallest number of 
audio segment bands in the generation of the voice signals. 
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54. (new) The synthesised voice signals according to Claim 47 characterised in that 

- the voice signals are generated by the concatenation of the used bands of audio 
segments which are processed with the aid of suitable functions depending on properties 
of the sound sequence or phone sequence, respectively, with the functions involving i.a. 
a modification of the frequency, the duration, the amplitude, or the spectrum. 

55. (new) The synthesised voice signals according to Claim 47 characterised in that 

- the voice signals are generated by the concatenation of the used bands of audio 
segments which are processed with the aid of suitable functions depending on properties 
of the sound sequence or phone sequence, respectively, in an area in which the instance 
of concatenation lies, with these properties including i. a. a modification of the frequency, 
the duration, the amplitude, or the spectrum. 

56. (new) The synthesised voice signals according to Claims 47 characterised in that the 
instance of concatenation lies at a place in the used bands of the earlier and/or the later 
audio segment, in which the two used bands are in agreement with respect to one or 
several suitable properties, with these properties including i.a.: zero points, amplitude 
values, gradients, derivatives of any degree, spectra, tone levels, amplitude values in a 
frequency band, volume, style of speech, emotion of speech, or other properties covered 
in the phone classification scheme. 

57. (new) The synthesised voice signals according to Claim 47 characterised in that the 
voice signals are suited for a conversion to acoustic signals. 

58. (new) An acoustical, optical, magnetic, or electrical data storage which contains audio 
segments in order generate synthesised acoustical data by means of a concatenation of 
used bands of the audio segments, utilising the methods according to Claim 1. 
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59. (new) The data storage according to Claim 58, characterised in that a group of the 
audio segments reproduces sounds or phones, respectively, or portions of sounds or 
phones, respectively. 

60. (new) The data storage according to Claim 58 characterised in that a group of the 
audio segments reproduces phone sequences or portions of phone sequences or 
polyphones, respectively, or portions of polyphones. 

61. (new) The data storage according to Claim 58 characterised in that a group of audio 
segments is provided whose used bands start with a static sound or phone, respectively, 
with the static phones comprising vowels, diphtongs, liquids, fricatives, vibrants, and 
nasals. 

62. (new) The data storage according to Claim 58 characterised in that audio segments 
are provided which are suitable for the conversion to acoustical signals 

63. (new) The data storage according to Claim 58 which additionally contains 
information in order to carry out a processing of the used bands of individual audio 
segments with the aid of suitable functions depending on properties of the acoustical data 
to be synthesised, with the functions involving i.a. a modification of the frequency, the 
duration, the amplitude, or the spectrum. 

64. (new) The data storage according to Claim 58 which additionally contains 
information relating to a processing of the used bands of individual audio segments with 
the aid of suitable functions in a band in which the instance of concatenation lies, with 
this function involving i.a. a modification of the frequency, the duration, the amplitude, 
or the spectrum. 

65. (new) The data storage according to Claim 58 which additionally provides linked 
audio segments, whose instance of concatenation lies at a place of the used bands of the 
earlier and/or later audio segment, where both used bands are in agreement with respect 
to one or several suitable properties with these properties being i.a.: zero points, ampli- 
tude values, gradients, derivatives of any degree, spectra, tone levels, amplitude values 
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in a frequency band, volume, style of speech, emotion of speech, or other properties 
covered in the phone classification scheme. 

66.(new) The data storage according to Claim 51, 

which additionally contains information in the form of heuristic knowledge, which relates 
to the selection of the used bands of the individual audio segments, their processing, then- 
variation, as well as their concatenation. 
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1 

Method and Devices for the Co-articulation-specific 
Concatenation of Audio Segments 

The invention relates to a method and a device for the conca- 
5 tenation of audio segments for the generation of synthesised 

acoustical data, in particular synthesised speech. In parti- 
cular, the invention relates to synthesised voice signals 
which have been generated by the inventive co-articulation- 
specific concatenation of voice segments, as well as to a data 
10 carrier which contains a computer program for the inventive 

generation of synthesised acoustical data, in particular, 
synthesised speech. 

in addition^ the invention relates to a data storage which 
15 contains audio segments which are suited for the inventive co- 

articulation-specific concatenation, and a sound carrier 
which, according to the invention, contains synthesised 
acoustical data. 

20 It must be emphasised that both the state of the art repre- 

sented in the following, and the present invention relate to 
the entire field of the synthesis Qf acoustical data by means 
of the concatenation of individual audio segments which are 
obtained in any manner. However, for the qake of simplifying 

25 the discussion of the state of the art as well as the des- 

criptioii of. the present invention, the following explanations 
refer specifically to synthesised voice data by means of the 
concatenation of individual voice segments. 

30 During the past years, the data-based approach has been suc- 

cessful over the rule-based approach in the field of speech 
synthesis, and can be found in various methods and systems for 
speech synthesis* Although the rule-based approach principally 
enables a better speech synthesis, it is necessary for its 

35 implementation to explicitly phrase the entire knowledge which 
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is required for speech generation, i.e. to formally model the 
speech to be synthesised. Due to the fact that the known 
speech models comprise a simplification of the speech to be 
synthesised, the voice quality of the speech generated in this 
5 manner is not sufficient. 

For this reason, a data-based speech synthesis is carried out 
to an increasing extent, wherein, corresponding segments are 
selected from a database containing individual voice segments 

10 and linked (concatenated) to each, other. In this context, the 

voice quality is primarily depending on the number and type of 
the available voice segments , because only that speech can be 
synthesised which is reproduced by voice segments in the data- 
base. In order to minimise the number of the voice segments to 

15 be provided and, nevertheless, to still generate a high quali- 

ty synthesised speech, various methods are known which carry 
out a linking (concatenation) of the voice segments according 
to complex rules. 

20 When using such methods or corresponding devices, respective- 

ly, an inventory, i.e. a database comprising the voice audio 
segments can be employed which is complete and manageable. An 
inventory is complete if it is capable of generating any sound 
sequence of the speech to be synthesised, and it is manageable 

25 if the number and type of the data of the inventory can be 

processed in a desired manner by means of the technically 
available means. Furthermore, such a method must ensure that 
the concatenation of the individual inventory elements gener- 
ates a synthesised speech which differs as little as possible 

30 from a naturally spoken speech. To this end, a synthesised 

speech must be fluent and comprise the same articulatory 
effects as a natural speech. In this context, the so-called 
co-articulatory effects, i.e. the mutual influence of phones, 
are of particular importance. For this reason, the inventory 

35 elements should be of such a nature that they consider the co- 
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articulation of individual successive phones. In addition, a 
method for the concatenation of the inventory elements should 
link the elements, even beyond word and phrase boundaries, 
under consideration of the co-articulation of individual 
5 successive phones as well as of the higher-order co-articul- 

ation of several successive phones. 



Before presenting the state of the art, a few terms from the 
field of speech synthesis, which are necessary for a better 
10 understanding, will be explained in the following: 

~ A phone is a class of any sound events (noises, sounds, 
tones, etc.). The sound events are classified in accordance 
with a classification scheme into phone classes. A sound event 

15 belongs to a phoneme if the values of the sound event are 

within the range of values defined for the phone with respect 
to the parameters (e.g. spectrum, tone level, volume, chest or 
head voice, co-articulation, resonance cavities, emotion, 
etc.) used for the classification. 

20 The classification scheme for phones depends on the type of 

application. For vocal sounds (= phones) , the IPA classifica- 
tion is generally used. However, the definition of the term 
phone as used herein is not limited to this, but any other 
parameters can be used. If, for example, in addition to the 

25 IPA classification, the tone level or the ^motional expression 

are included as parameters in the classification, two T a T 
phones with different tone level or different emotional ex- 
pression become different phones in the sense of the defini- 
tion. Phones can, however, also be the tones of a musical 

30 instrument, e.g. a violin, in the different tone levels and 

the different modes of playing (up-bow and down-bow, detache, 
spiccato, marcato, pizzicato, col legno, etc.). Phones can be 
the barking of dogs or the squealing of a car door. 
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Phones can be reproduced by audio segments which contain cor- 
responding acoustical data. 

In the description of the invention following the definitions, 
5 the term vocal sound can invariably be replaced by the term 

phone in the sense of the previous definition, and the term 
phoneme can be replaced by the term phonetic character. (This 
also applies the other way round, because phones are vocal 
sounds classified according to the IPA classification) . 

10 

- A static phone has bands which are similar to previous or 
subsequent bands of the static phone. The similarity need not 
necessarily be an exact correspondence as in the periods of a 
sinusoidal tone, but is analogous to the similarity as it pre- 

15 vails between the bands of the static phones defined in the 

following . 

- A dynamic phone has no bands with a similarity with previous 
or subsequent bands of the dynamic phone, such as, e.g. the 

20 sound event of an explosion or a dynamic phone. 

- A phone is a vocal sound which is generated by the organs of 
speech (a vocal sound) . The phones are classified into static 
and dynamic phones. 

25 

- The static phones include vowels, diphtongs, nasals, later- 
als, vibrants, and fricatives. 

- The dynamic phones include plosives, affricates, glottal 
30 stops, and click sounds. 

- A phoneme is the formal description of a phone, with the 
formal description usually being effected by phonetic char- 
acters . 

35 
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- The co-articulation refers to the phenomenon that a sound, 
i.e. a phone, too, is influenced by upstream or downstream 
sounds or phones, respectively, with the co-articulation 
occurring both between immediately neighbouring sounds/phones, 

5 but also covering a sequence of several sounds/phones as well 
(for example in rounding the lips) . 

A sound or phone, respectively, can therefore be classified 
into three bands (see also Fig, lb) : 

10 

- The initial co-articulation band comprises the band from the 
start of a sound/phone to the end of the co-articulation due 
to a upstream sound/phone. 

15 - The solo articulation band is the band of the sound/phone 

which is not influenced by an upstream or downstream sound or 
an upstream or downstream phone, respectively. 

™ The end co-articulation band comprises the band from the 
20 start of the co-articulation due to a downstream sound/phone 

to the end of the sound/phone. 

- The co-articulation band comprises an end co-articulation 
band and the neighbouring initial co-articulation band of the 

25 neighbouring sound/phone. 

~ A polyphone is a sequence of phones. 

- The elements of an inve ntory are audio segments stored in a 
30 coded form which reproduce sounds, portions of sounds, se- 
quences of sounds, or portions of sequences of sounds, or 
phones, portions of phones, polyphones, or portions of poly- 
phones, respectively. For a better understanding of the po- 
tential structure of an audio segment/inventory element, re- 

35 ference is made to Fig. 2a which shows a conventional audio 
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segment, and Figs. 2b - 21 which show inventive audio seg- 
ments. In addition, it should be mentioned that audio segments 
can can be formed from smaller or larger audio segments which 
are included in the inventory or a database. Furthermore, 
5 audio segments can also be provided in a transformed form 

(e.g. in a Fourier-transformed form) in the inventory or the 
database. Audio segments for the present invention can also 
come from a prior synthesis step (which is not part of the 
method) . Audio segments include at least a part of an initial 
10 co-articulation band, a solo articulation band, and/or an end 

co-articulation band. In lieu of audio segments, it is also 
possible to use bands of audio segments. 

- The term concatenation implies the joining of two audio seg- 
15 merits. 

- The concatenation instance if the point of time in which two 
audio segments are joined. 

20 The concatenation can be effected in various ways, e.g. with a 

cross fade or a hard fade (see also Figs. 3a - 3e) : 

- In a cross fade , a downstream band of a first audio segment 
band and an upstream band of a second audio segment band are 

25 processed by means of suitable transfer functions, and subse- 

quently these two bands are overlappingly added in such a 
manner that at the most the shorter band with respect to time 
of the two bands is completely overlapped by the longer one 
with respect to time of the two band. 

30 

- In a hard fade , a later band of a first audio segment and an 
earlier band of a second audio segment are processed by means 
of suitable transfer functions, with the two audio segments 
being joined to one another in such a manner that the later 
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band of the first audio segment and the earlier band of the 
second audio segment do not overlap. 

The co-articulation band is primarily noticeable in that a 
5 concatenation therein is associated with discontinuities (e.g. 

spectral skips) . 

In addition, reference is to be made that, strictly speaking, 
a hard fade is a boundary case of a cross fade, in which an 

10 overlap of a later band of a first audio segment and an 

earlier band of a second audio segment has a length of zero. 
This allows to replace a cross fade with a hard fade in 
certain, e.g. extremely time-critical applications, with such 
an approach to be contemplated scrupulously, because it re- 

15 suits in considerable quality losses in the concatenation of 

audio segments which actually are to be concatenated by a 
cross fade, 

- The term prosody refers to changes in the voice frequency 
20 and the voice rhythm which occur in spoken words or phrases, 

respectively. The consideration of such prosodic information 
is necessary in the speech synthesis in order to generate a 
natural word or phrase melody, respectively. 

25 From WO 95/30193 a method and a device are known for the con- 

version of text to audible voice signals under utilising a 
neural network. For this purpose, the text to be converted to 
speech is converted to a sequence of phonema by means of a 
converter unit, with information on the syntactic boundaries 

30 of the text and the stress of the individual components of the 

text being additionally generated. This information, together 
with the phonema, are transferred to a device which determines 
the duration of the pronunciation of the individual phonema in 
a rule-based manner. A processor generates a suitable input 

35 for the neural network from each individual phoneme in connec- 
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tion with the corresponding syntactic and time-related in- 
formation, with said input for the neural network also com- 
prising the corresponding prosodic information for the entire 
phoneme sequence. From the available audio segments the neural 
5 network then selects only those segments which best reproduce 
the input phonema and links said audio segments accordingly. 
In this linking operation the individual audio segments with 
respect to their duration, total amplitude, and frequency are 
matched to upstream and downstream audio segments under con- 
10 sideration of the prosodic information of the speech to be 

synthesised and time successively connected with each other, A 
modification of individual bands of the audio segments is not 
described therein. 

15 For the generation of the audio segments which are required 

for this method, the neural network has first to be trained by 
dividing naturally spoken speech into phones or phone se- 
quences and assigning these phones or phone sequences corres- 
ponding phonema or phoneme sequences in the form of audio 

20 segments. Due to the fact that this method provides for a 

modification of individual audio segments only, but not for a 
modification of individual bands of an audio segment, the 
neural network must be trained with as many different phones 
or phone sequences as possible for converting any text to a 

25 synthesised speech with a natural sound. Depending of the 

application, this may prove to require very high expenditures. 
On the other hand, an insufficient training process of the 
neural network may have a negative influence on the quality of 
the speech to be synthesised. Moreover, it is not possible 

30 with the method described therein to determine the concatena- 

tion instance of the individual audio segments depending on 
upstream or downstream audio segments, in order to perform a 
co -articulation- specific concatenation. 
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US-5,524,172 describes a device for the generation of syn- 
thesised speech, which utilises the so-called diphone method. 
Here, a text which is to be converted to synthesised speech is 
divided into phoneme sequences, with corresponding prosodic 
5 information being assigned to each phoneme sequence. From a 

database which contains audio segments in the form of di- 
phones, for each phoneme of the sequence two diphones repro- 
ducing the phoneme are selected and concatenated under con- 
sideration of the corresponding prosodic information. In the 

10 concatenation the two diphones each are weighted by means of a 

suitable filter, and the duration and tone level of both di- 
phones modified in such a manner that upon the linking of the 
diphones a synthesised phone sequence is generated, whose 
duration and tone level correspond to the duration and tone 

15 level of the desired phoneme sequence. In the concatenation 

the individual diphones are added in such a manner that a 
later band of a first diphone and an earlier band of a second 
diphone overlap, with the instance of concatenation being 
generally in the area of stationary bands of the individual 

20 diphones (see Fig. 2a) . Due to the fact that a variation of 

the instance of concatenation under consideration of the co- 
articulation of successive audio segments (diphones) is not 
intended, the quality (naturalness and audibility) of a speech 
synthesised in such a manner can be negatively influenced. 

25 

A further development of the previously discussed method can 
be found in EP-0,813,184 Al . In this case, too, a text to be 
converted to synthesised speech is divided into individual 
phonema or phoneme sequences, and corresponding audio segments 

30 are selected from a database and concatenated. In order to 

achieve an improvement of the synthesised speech, two 
approaches have been realised with this method, which differ 
from the state of the art discussed so far. With the use of a 
smoothing filter which accounts for the lower-frequency har- 

35 monic frequency components of an upstream and a downstream 
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audio segment, the transition from the upstream audio segment 
to the downstream audio segment is to be optimised, in that a 
later band of the upstream audio segment and an earlier band 
of the downstream audio segment in the frequency range are 
5 tuned to each other. In addition, the database provides audio 

segments which are slightly different from one another but are 
suited for synthesising one and the same phoneme. In this 
manner, the natural variation of the speech is to be mimicked 
in order to achieve a higher quality of the synthesised 

10 speech. Both the use of the smoothing filter and the selection 

from a plurality of various audio segments for the realisation 
of a phoneme require a high computing power of the used system 
components in the implementation of this method. Moreover, the 
volume of the database increases due to the increased number 

15 of the provided audio segments. Furthermore, this method, too, 

does not provide for a ca-articulation dependent choice of the 
concatenation instance of individual audio segments, which may 
reduce the quality of the synthesised speech. 

20 DE 693 18 209 T2 deals with formant synthesis. According to 

this document two multi-voice phones are connected with each 
other using an interpolation mechanism which is applied to a 
last phoneme of an upstream phone and to a first phoneme of a 
downstream phone, with the two phonema of the two phones being 

25 identical and with the connected phones are superposed to one 

phoneme. Upon the superposition, each of the curves describing 
the two phonema is weighted with a weighting function. The 
weighting function is applied to a band of each phoneme, which 
begins immediately after the start of the phoneme and ends 

30 immediately before the end of the phoneme. Thus, in the con- 

catenation of phones described therein, the bands of the 
phonema, which form the transition between phones, correspond 
essentially to the respective entire phonema. This means, that 
portions of the phonema used for concatenation, invariably 

35 comprise all three bands, i.e. the respective initial co- 
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articulation band, solo articulation band, and end co-arti- 
culation band* Consequently, Dl teaches an approach how the 
transitions between two phones are to be smoothed. 

5 Moreover, according to this document the instance of the con- 
catenation of two phones is established in such a manner that 
the last phoneme in the upstream phone and the first phoneme 
in the downstream phone completely overlap. 

10 Principally, it is to be stated that DE 689 15 353 T2 aims at 

improving the tone quality, in that an approach is specified 
how to design the transition between two neighbouring sampling 
values. This is of particular relevance in the case of low 
sampling rates. 

15 

In the speech synthesis described in this document, waveforms 
are used which reproduce the phones to be concatenated. With 
waveforms for upstream phones, a corresponding final sampling 
value and an associated zero crossing point are established, 

20 while with waveforms for downstream phones, a corresponding 

first upper sampling value and an associated zero crossing 
point are established. Depending on these established sampling 
values and the associated zero crossing points, phones are 
connected with each other by means of maximal four different 

25 ways. The number of connection types is reduced to two, if the 

waveforms are generated by utilising the Nyquist theoreme. DE 
689 15 353 T2 describes that the used band of waveforms ex- 
tends between the last sampling value of the upstream waveform 
and the first sampling value of the downstream waveform. A 

30 variation of the duration of the used bands as a function of 

the waveforms to be concatenated, as it is the case with the 
invention, is not disclosed in Dl . 

In summary, it can be said that the state of the art allows to 
35 synthesise any phoneme sequences, but that the phoneme se~ 
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quences synthesised in this manner do not possess an authentic 
voice quality. A synthesised phoneme sequence has an authentic 
voice quality if it cannot be distinguished by a listener from 
the same phoneme sequence spoken by a real speaker, 

5 

Methods are also known which use an inventory which comprises 
complete words and/or phrases in authentic voice quality as 
inventory elements. For the speech synthesis, these elements 
are brought into a desired order, with the possibilities of 
10 various voice sequences being limited to a high degree by the 

volume of such an inventory. The synthesis of any phoneme 
sequences is not possible with these methods. 

It is therefore the object of the present invention to provide 
15 a method and a corresponding device which eliminate the prob- 

lems of the state of the art and enable the generation of 
synthesised acoustical data, in particular, synthesised voice 
data, which a listener cannot distinguish from corresponding 
natural acoustical data, in particular, naturally spoken 
20 speech. The acoustical data synthesised by means of the in- 

vention, in particular, synthesised voice data, is to possess 
an authentic acoustical quality, in particular, an authentic 
voice quality. 

25 For the solution of this object the invention provides a 

method according to Claim 1, a device according to Claim 14, 
synthesised voice signals according to Claim 28, a data 
carrier according to Claim 39, a data storage according to 
Claim 51, as well as a sound carrier according to Claim 60. 

30 The invention therefore makes it possible to generate syn- 

thesised acoustical data which reproduces a sequence of 
phones, in that in the concatenation of audio segments, the 
instance of the concatenation of two audio segments is deter- 
mined, depending on properties of the audio segments to be 

35 linked, in particular the co-articulation effects which relate 
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to the two audio segments. According to the present invention, 
the instance of concatenation is preferably selected in the 
vicinity of the boundaries of the solo articulation band* In 
this manner, a voice quality is achieved, which cannot be 
5 obtained with the state of the art. The required computation 

power is not higher than with the state of the art. 

In order to mimic the variations which can be found in the 
corresponding natural acoustical data, in the synthesis of 

10 acoustical data, the invention provides for a different selec- 

tion of the audio segment bands as well as for different ways 
of the co-articulation-specific concatenation. A higher degree 
of naturalness of the synthesised acoustical data is achieved 
if a later audio segment band, whose start reproduces a static 

15 phone, is connected with an earlier audio segment band by 

means of a cross fade, or if a later audio segment band, whose 
start reproduces a dynamic phone, is connected with an earlier 
audio segment band by means of a hard fade, respectively. In 
addition, it is advantageous to generate the start of the 

20 synthesised acoustical data to be generated by using an audio 

segment band which reproduces the start of a phone sequence, 
or to generate the end of the synthesised acoustical data to 
be generated by using an audio segment band which reproduces 
the end of a phone sequence, respectively. 

25 

In order to carry out the generation of the synthesised 
acoustical data in a simpler and faster way, the invention 
makes it possible to reduce the number of audio segment bands 
which are required for data synthesising, in that audio seg- 

30 ment bands are used which always start with the reproduction 

of a dynamic phone, which allows to carry out all concatena- 
tions of these audio segment bands by means of a hard fade. 
For this purpose, later audio segment bands are connected with 
earlier audio segment bands whose starts always reproduce a 

35 dynamic phone. In this manner, high-quality synthesised 



14 



acoustical data according to the invention can be generated 
with low computing power (e.g. in the case of answering 
machines or car navigation systems) . 

5 In addition, the invention provides for mimicking acoustical 

phenomena which result because of a mutual influence of indi- 
vidual segments of corresponding natural acoustical data. In 
particular, it is intended here to process individual audio 
segments or individual bands of the audio segments, respect- 

10 ively, with the aid of suitable functions. Thus it is possible 

to modify i.a. the frequency, the duration, the amplitude, or 
the spectrum of the audio segments. If synthesised voice data 
is generated by means of the invention, then preferably pro- 
sodic information and/or higher-order co-articulation effects 

15 are taken into consideration for the solution of this object. 

The signal characteristic of synthesised acoustical data can 
additionally be improved if the concatenation instance is set 
in places of the individual audio segment bands to be connect- 

2 0 ed, where the two used bands are in agreement with each other 

with respect to one or several suitable properties. These pro- 
perties can be i.a.: zero point, amplitude value, gradient, 
derivative of any degree, spectrum, tone level, amplitude 
value in a frequency band, volume, style of speech, emotion of 

25 speech, or other properties covered in the phone classifica- 

tion scheme. 

The invention further enables to improve the selection of 
audio segment bands for the generation of the synthesised 
30 acoustical data, as well as to make their concatenation more 

efficient, in that heuristic knowledge is used which relates 
to the selection, processing, variation, and concatenation of 
the audio segment bands. 
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In order to generate synthesised acoustical data which is 
voice data which does not differ from corresponding natural 
voice data, preferably audio segment bands are used which re- 
produce sounds/phones or portions of sound sequences/phone 
5 sequences. 



Furthermore, the invention permits the utilisation of the 
generated synthesised acoustical data, in that this data is 
convertible to acoustical signals and/or voice signals, and/or 
10 storable in a data carrier. 



In addition, the invention can be used for providing synthe- 
sised voice signals which differ from known synthesised voice 
signals in that, concerning their naturalness and audibility, 

15 they do not differ from real speech. For this purpose, audio 

segment bands are concatenated in a co-articulation-specific 
manner, each of which reproduces portions of the sound se- 
quence/phone sequence of the speech to be synthesised, in that 
the bands of the audio segments to be used as well as the 

20 instance of the concatenation of these band are established 

according to the invention as defined in Claim 28. 

A further improvement of the synthesised speech can be achiev- 
ed if a later audio segment band whose start reproduces a 

25 static phone is connected with an earlier audio segment band 

by means of a cross fade, or if a later audio segment band 
whose start reproduces a dynamic phone, respectively, is con- 
nected with an earlier audio segment band by means of a hard 
fade. Herein, static phones comprise vowels, diphtongs, 

30 liquids, fricatives, vibrants, and nasals, and dynamic phones 

comprise plosives, affricates, glottal stops, and klick 
speech. 
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Due to the fact that the start and end stresses of phones in a 
natural speech differ from comparable, but embedded phones, it 
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is to be preferred to use corresponding audio segment bands, 
whose starts reproduce the start of the speech to be synthe- 
sised and whose ends reproduce the end of same, respectively. 

5 In particular in the generation of synthesised speech, a fast 

and efficient procedure is desirable. For this purpose, it is 
to be preferred to carry out the inventive co-articulation- 
specific concatenation invariably by means of hard fades, with 
only such audio segment bands being used whose starts always 
10 reproduce a dynamic sound or phone, respectively. Such audio 

segment bands can be generated in advance according to the 
invention by means of the co-articulation-specific concaten- 
ation of corresponding audio segment bands. 

15 In addition, the invention provides voice signals which have a 

natural flow of speech, speech melody, and speech rhythm, in 
that audio segment bands are processed before and/or after the 
concatenation in their entirety or in individual bands by 
means of suitable functions. It is particularly advantageous 

20 to perform this variation additionally in areas in which the 

corresponding instances of concatenation are set in order to 
change i.a. the frequency, duration, amplitude, or spectrum. 

An still further improved signal characteristic can be achiev- 
25 ed if the concatenation instances are set in places of the 

audio segment bands to be linked, where these are in agreement 
with respect to one or several properties. 

In order to permit a simple utilisation and/or further pro- 
30 cessing of the inventive voice signals by means of known 

methods or devices, such as a CD player, it is to be preferred 
in particular that the voice signals are convertible to 
acoustical signals or are storable in a data carrier. 
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For the purpose of applying the invention also to known de- 
vices such as a personal computer or a computer-controlled 
musical instrument, a data carrier is provided which contains 
a computer program which enables the performance of the in- 
5 ventive method or the control of the inventive device and its 

various embodiments, respectively. In addition, the inventive 
data carrier also permits the generation of voice signals 
which comprise co-articulation-specific concatenations . 

10 For providing an inventory comprising audio segments, by means 

of which synthesised acoustical data, in particular synthesis- 
ed voice data, can be generated which does not differ from 
corresponding natural acoustical data, the invention provides 
a data storage which includes audio segments which are suited 

15 for being inventively concatenated to synthesised acoustical 

data. Preferably, such a data carrier includes audio segments 
which are suited for the performance of the inventive method, 
for application in the inventive device, or the inventive data 
carrier. Alternatively, the data carrier can also include 

20 inventive voice signals. 

In addition, the invention makes it possible to provide in- 
ventive synthesised acoustical data, in particular synthesised 
voice data, which can be utilised with conventional devices, 

25 e.g. a tape recorder, a CD player, or a PC audio card. For 

this purpose, a sound carrier is provided which comprises data 
which at least partially has been generated by the inventive 
method or by means of the inventive device or by using the 
inventive data carrier or the inventive data storage, respect- 

30 ively. The sound carrier may also comprise data which are the 

inventively co-articulation-specific concatenated voice sig- 
nals. 
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Further properties, characteristics, advantages, or modifica- 
tions of the invention will be explained with reference to the 
following description; in which: 

5 Fig. la is a schematic representation of an inventive device 

for the generation of synthesised acoustical data; 
Fig. lb shows the structure of a sound/phone; 
Fig. 2a shows the structure of a conventional audio segment 
according to the state of the art, consisting of portions of 
10 two phones, i.e. a diphone for voice. It is essential that the 

solo articulation bands each are included only partially in 
the conventional diphone audio segment. 

Fig. 2b shows the structure of an inventive audio segment 
which reproduces portions of a sound/phone with downstream co- 
15 articulation bands (for voice a quasi 'displaced 1 diphone); 

Fig. 2c shows the structure of an inventive audio segment 
which reproduces portions of a sound/phone with upstream co- 
articulation bands; 

Fig. 2d shows the structure of an inventive audio segment 
2 0 which reproduces portions of a sound/phone with downstream co- 

articulation bands and includes additional bands; 
Fig. 2e shows the structure of an inventive audio segment 
which reproduces portions of a sound/phone with upstream co- 
articulation bands and includes additional bands; 
25 Fig. 2f shows the structure of an inventive audio segment 

which reproduces portions of several sounds/phones (for 
speech: a polyphone) with downstream co-articulation bands 
each. The sounds/phones 2 to (n-1) each are completely in- 
cluded in the audio segment. 
30 Fig. 2g shows the structure of an inventive audio segment 

which reproduces portions of several sounds/phones (for 
speech: a polyphone) with upstream co-articulation bands each. 
The sounds/phones 2 to (n-1) each are completely included in 
the audio segment. 
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Fig. 2h shows the structure of an inventive audio segment 
which reproduces portions of several sounds/phones (for 
speech: a polyphone) with downstream co-articulation bands 
each and includes additional bands. The sounds/phones 2 to 
5 (n-1) each are completely included in the audio segment. 

Fig. 2i shows the structure of an inventive audio segment 
which reproduces portions of several sounds/phones (for 
speech: a polyphone) with downstream co-articulation bands 
each and includes additional bands. The sounds/phones 2 to 
10 (n-1) each are completely included in the audio segment. 

Fig. 2j shows the structure of an inventive audio segment 
which reproduces a portion of a sound/phone of the start of a 
sound sequence/phone sequence; 

Fig. 2k shows the structure of an inventive audio segment 
15 which reproduces portions of sounds/phones of the start of a 

sound sequence/phone sequence; 

Fig. 21 shows the structure of an inventive audio segment 
which reproduces a sound/phone of the end of a sound sequence 
/phone sequence; 

20 Fig. 3a shows the concatenation according to the state of the 

art by means of an example of two conventional audio segments. 
The segments begin and end with portions of the solo articula- 
tion bands (generally half of same) . 

Fig. 3al shows the concatenation according to the state of the 
25 art. The solo articulation band of the middle phone comes from 

two different audio segments. 

Fig. 3b shows the concatenation according to the inventive 
method by means of an example of two audio segments, each of 
which containing a sound/phone with downstream co-articulation 
30 bands. Both sounds/phones come from the centre of a phone unit 

sequence . 

Fig. 3bl shows the concatenation of these audio segments by 
means of a cross fade. 

The solo articulation band comes from an audio segment. The 
35 transition between the audio segments is effected between two 
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bands and is therefore less susceptible to variations (in 
spectrum, frequency, amplitude, etc.). The audio segments can 
also be processed by means of additional transfer functions 
prior to the concatenation, 
5 Fig. 3bII shows the concatenation of these audio segments by 

means of a hard fade; 

Fig. 3c shows the concatenation according to the inventive 
method by means of an example of two inventive audio segments, 
each of which containing a sound/phone with downstream co- 
10 articulation bands, with the first audio segment coming from 

the start of a phone sequence. 

Fig. 3d shows the concatenation of these audio segments by 
means of a cross fade; 

Fig. 3cII shows the concatenation of these audio segments by 

15 means of a hard fade; 

Fig. 3d shows the concatenation according to the inventive 
method by means of an example of two inventive audio segments, 
each of which containing a sound/phone with upstream co-arti- 
culation bands. Both audio segments come from the centre of a 

20 phone sequence. 

Fig. 3dl shows the concatenation of these audio segments by 
means of a cross fade. The solo articulation band comes from 
an audio segment. 

Fig. 3dII shows the concatenation of these audio segments by 

25 means of a hard fade; 

Fig. 3e shows the concatenation according to the inventive 
method by means of an example of two inventive audio segments, 
each of which containing a sound/phone with downstream co- 
articulation bands, with the last audio segment coming from 

30 the end of a phone sequence; 

Fig. 3el shows the concatenation of these audio segments by 
means of a cross fade; 

Fig. 3eII shows the concatenation of these audio segments by 
means of a hard fade; 
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Fig. 4 is a schematic representation of the steps of the in- 
ventive method for the generation of synthesised acoustical 
data. 

5 The reference numerals used in the following refer to Fig. la 

and the numbers of the various steps of the method used in the 
following refer to Fig, 4. 

In order to convert for example a text to synthesised speech 
10 by means of the invention, it is necessary to divide this text 

in a preparatory step into a sequence of phonetic characters 
or phonema/ respectively. Preferably, prosodic information 
corresponding to the text is to be generated as well. The 
sound or phone sequence, respectively, as well as the prosodic 
15 and additional information serve as input values for the in- 

ventive method or the inventive device, respectively. 

The sounds/phones to be synthesised are supplied to an input 
unit 101 of the device 1 for the generation of synthesised 

20 voice data and stored in a first memory unit 103 (see Fig. 

la) . By means of a selection means 105 audio segments are 
selected from an inventory including audio segments (elements) 
which is stored in a database 107, or by an upstream synthesis 
means 108 (which is not part of the invention), which reprod- 

25 uce sounds or phones, respectively, or portions of sounds or 

phones, respectively, which correspond to the individually 
input phonetic characters or phonema, respectively, or por- 
tions of same and stored in a second memory unit 109 in an 
order corresponding to the order to the input phonetic char- 

30 acters or phonema, respectively. If the inventory includes 

portions of phone sequences or of audio segments, the selec- 
tion unit 105 preferably selects those audio segments which 
reproduce the highest number of portions of the phone se- 
quences or polyphones, respectively, which correspond to a se- 

35 quence of phonetic characters or phonema, respectively, from 
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the input phone sequence or phoneme sequence, respectively, so 
that a minimum number of audio segments is required for the 
synthesis of the input phoneme sequence. 

5 If the database 107 or the upstream synthesis means 108 pro- 

vides an inventory with audio segments of different types, the 
selection means 105 preferably selects the longest audio seg- 
ment bands which reproduce portions of the sound sequence/ 
phone sequence in order to synthesise the input sound sequence 

10 or phone sequence, respectively, and/or a sequence of sounds/ 

phones from a minimum number of audio segment bands. In this 
context, it is advantageous to use audio segment bands repro- 
ducing linked sounds/phones, which reproduce an earlier static 
sound/phone and a later dynamic sound phone. In this manner, 

15 audio segments are generated which, because of the embedded 

dynamic sounds/phones invariably begin with a static sound/ 
phone. For this reason, the concatenation procedure for such 
audio segments is simplified and standardised, because only 
cross fades are required for this. 
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In order to achieve a co-articulation-specific concatenation 
of the audio segment bands to be linked, the concatenation in- 
stances of two successive audio segment bands are established 
with the aid of a concatenation means 111 as follows: 



- If an audio segment band is to be used for synthesising the 
start of the input sound sequence/phone sequence (step 1), an 
audio segment band is to be selected from the inventory, which 
reproduces the start of a sound sequence/phone sequence and to 
30 be linked with a later audio segment band (see Fig. 3c and 

step 3 in Fig. 4) . 
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- In the concatenation of a second audio segment band with an 
earlier first audio segment band, a distinction must be made 
as to whether the second audio segment band starts with the 
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reproduction of a static sound/phone or a dynamic sound/phone 
in order to appropriately make the selection of the instance 
of concatenation (step 6) . 

5 - If the second audio segment band starts with a static sound/ 

phone, then the concatenation is carried out in the form of a 
cross fade, with the instance of concatenation being set in 
the downstream portion of the first audio segment band and in 
the upstream portion of the second audio segment band, with 
10 the two bands overlapping in the concatenation or at least 

bordering on one another (see Figs. 3bl, 3d, 3dl, and 3el; 
concatenation by means of cross fade) • 

- If the second audio segment band starts with a dynamic sound 
15 /phone, then the concatenation is carried out in the form of a 

hard fade, with the instance of concatenation being set im- 
mediately after of the downstream portion of the first audio 
segment band and immediately before the upstream band of the 
second audio segment band (see Figs- 3bII, 3cll, 3dII, and 
20 3eII; concatenation by means of hard fade) . 

In this manner, new audio segments can be generated from the 
originally available audio segment bands, which start with the 
reproduction of a static sound/phone* This is achieved in that 

25 audio segment bands which start with the reproduction of a 

dynamic sound/phone are linked later with audio segment bands 
which start with the reproduction of a static sound/phone. 
Though this increases the number of audio segments or the 
volume of the inventory, respectively, can, however, be a 

30 computational advantage, because fewer individual concatena- 

tions are required for the generation of a phone sequence/ 
phoneme sequence, and concatenations have to carried out only 
in the form of cross fades. Preferably, the new linked audio 
segments are supplied to the database 107 or another memory 

35 unit 113. 
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A further advantage of this linking of the original audio seg- 
ment bands to new longer audio segments results if, for 
example, a sequence of sounds/phones frequently repeats itself 
in the input sound sequence/phone sequence. It is then poss- 
5 ible to utilise one of the new correspondingly linked audio 

segments, and it is not necessary to carry out another conca- 
tenation of the originally available audio segment bands with 
each occurrence of this sequence of sounds/phones. Preferably, 
overlapping co-articulation effects, too, are to be covered, 
10 or specific co-articulation effects in the form of additional 

data is to be assigned to the stored linked audio segment, 
respectively, when storing such linked audio segments. 

If an audio segment band is to be used for synthesising the 
15 end of the input sound sequence/phone sequence, an audio seg- 

ment band is to be selected from the inventory, which repro- 
duces an end of a sound sequence/phone sequence, and to be 
linked with an earlier audio segment band (see Fig. 3e and 
step 8 in Fig. 4) . 

20 

The individual audio segments are stored in a coded form in 
the database 107, with the coded form of the audio segments, 
apart from the waveform of the respective audio segment, being 
able to indicate which type of concatenation (e.g. hard fade, 

25 linear or exponential cross fade) is to be carried out with 

which later audio segment band, and at which instance the con- 
catenation takes place with which later audio segment band. 
Preferably, the coded form of the audio segments also includes 
information with respect to the prosody, higher-order co-arti- 

30 culations and transfer functions which are used to achieve an 

additional improvement of the voice quality. 

In the selection of the audio segment bands for synthesising 
the input sound sequence/phone sequence, the audio segment 
35 bands selected as the later ones are such that they correspond 
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to the properties of the respective earlier audio segment 
bands, i.a. type of concatenation and concatenation instance* 
After the selection of the audio segment bands, each of which 
reproducing portions of the sound sequence/phone sequence, 
5 from the database 107 or the upstream synthesising means 108, 

the concatenation of two successive audio segment bands by 
means of the concatenation means 111 is carried out as 
follows. The waveform, the type of concatenation, the conca- 
tenation instance as well as any additional information, if 

10 required, of the first audio segment band and the second audio 

segment band are loaded from the database of the synthesising 
means (Fig. 3b and steps 10 and 11) . Preferably such audio 
segment bands are selected in the above mentioned selection of 
the audio segment bands, which are in agreement with each 

15 other with respect to their type and instance of concatena- 

tion. In this case, loading of information with respect to 
type and instance of concatenation of the second audio segment 
band is no longer necessary. 

20 For the concatenation of the two audio segment bands, the 

waveform of the first audio segment band in a later band and 
the waveform of the second audio segment band in an earlier 
band, each are processed by means of suitable transfer func- 
tions, e.g. multiplied by a suitable weighting function (see 

25 Fig. 3b, steps 12 and 13) . The lengths of the later band of 

the first audio segment and of the earlier band of the second 
audio segment result from the type of concatenation and the 
time position of the concatenation instance, with these 
lengths also being able to be stored in the coded form of the 

30 audio segments in the database. 

If the two audio segment bands are to be linked by means of a 
cross fade, they are added in an overlapping manner according 
to the respective instance of concatenation (see Figs. 3bl, 
35 3d, 3dl, and 3el; step 15). Preferably, a linear symmetrical 
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cross fade is to be used herein, however, any other type of 
cross fade or any type of transfer function can be employed as 
well. If a concatenation in the form of a hard fade is to be 
carried out, the two audio segment bands are not joined conse- 
5 cutively in an overlapping manner (see Figs, 3bII, 3cII, 3dII, 

and 3eII; step 15) . As can be seen in Fig. 3bII, the two audio 
segment bands are arranged immediately successive in time. In 
order to be able to further process the voice generated in 
this manner, it is preferably stored in a third memory unit 
10 115. 

For the further linking with successive audio segment bands, 
the audio segments bands linked so far are considered as a 
first audio segment band (step 16) , and the above described 
15 linking process is repeated until the entire sound sequence/ 

phone sequence has been synthesised. 

For an improvement of the quality of the synthesised voice 
data, the prosodic and additional information which are input 

20 in addition to the sound sequence/phone sequence, are pre- 

ferably to be considered in the linking of the audio segment 
bands. By means of known methods, the frequency, duration, 
amplitude, and/or spectral properties of the audio segment 
bands can be modified before and/or after the concatenation in 

25 such a manner that the synthesised voice data comprises a 

natural word and/or phrase melody (steps 14, 17, or 18). In 
this context it is to be preferred to select concatenation 
instances at places of the audio segment bands, at which they 
agree in one or several suitable properties. 

30 

In order to optimise the transitions between two successive 
audio segment bands, the processing of the two audio segment 
bands by means of suitable functions in the area of the 
concatenation instance is additionally provided, in order to 
35 i.a. tune the frequencies, durations, amplitudes, and spectral 
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properties. The invention additionally permits to take into 
consideration higher-order acoustical phenomena of a real 
speech, such as for example higher-order co-articulation 
effects of style of speech (i.a. whispering, stress, singing 
voice, falsetto, emotional expression) in the synthesising of 
the sound sequence/phone sequence. For this purpose, inform- 
ation relating to such higher-order phenomena, is additionally 
stored in a coded form with the corresponding audio segment 
bands in order to select only such audio segment bands in the 
selection which correspond to the higher-order co-articulation 
properties of the earlier and/or later audio segment bands. 

The synthesised voice data generated in this manner preferably 
have a form which, with the aid of an output means 117, allows 
to convert the voice data to acoustical voice signals and to 
store the voice data and/or voice signals in an acoustical, 
optical, magnetic, or electrical data carrier (step 19) . 

Generally, inventory elements are generated via the recording 
of actually spoken speech. Depending on the level of training 
of the inventory-building speaker, i.e. his or her capability 
for controlling the speech to be recorded (e.g. to control the 
tone level of the speech or to speak exactly on one tone 
level), it is possible to generate identical or similar in- 
ventory elements which have displaced boundaries between the 
solo articulation bands and the co-articulation bands. This 
results in considerably more possibilities of setting the 
concatenation points in different places. As a consequence, 
the quality of a speech to be synthesised can be considerably 
enhanced. 

This invention allows for the first time to generate synthe- 
sised voice signals by means of a co-articulation-specific 
concatenation of individual audio segment bands, because the 
instance of concatenation is selected depending on the res- 
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pective audio segment bands to be linked. In this manner, a 
synthesised speech can be generated which is no longer dis- 
tinguishable from a naturally spoken speech. Contrary to known 
methods or devices, the audio segments used herein are not 
5 generated by speaking or recording, respectively, complete 

words, in order to ensure an authentic voice quality. It is 
therefore possible by means of this invention to generate syn- 
thesised speech of any contents with the quality of an actu- 
ally spoken speech. 

0 

Although this invention is described by way of the example of 
the speech synthesis, it is not limited to the field of syn- 
thesised speech, but can be used for synthesising any acoust- 
ical data or any sound events, respectively. This invention 
.5 can therefore be employed for the generation and/or provision 

of synthesised voice data and/or voice signals for any 
language or dialect, as well as for the synthesis of music. 



09/763149 



4 (mm if JC02 Rec'd PCT/PTO 1 6 FEB 2001 



Claims 

1. A method for the co-articulation-specific concatenation 
of audio segments, in order to generate synthesised acoustical 
data which reproduces a sequence of concatenated sounds/ 
phones, comprising the following steps: 

- selecting at least two audio segments which contain bands, 
each of which reproducing a portion of a sound/phone or a 
portion of a sound/phone sequence, 

- establishing a band to be used of an earlier audio segment; 

- establishing a band to be used of a later audio segment, 
which begins with the later audio segment and ends with the 
co-articulation band of the later audio segment which follows 
the initially used solo articulation band; 

- with the duration and position of the bands to be used being 
determined as a function of the earlier and later audio seg- 
ments; and 

- concatenating the established band of the earlier audio seg- 
ment with the established band of the later audio segment, in 
that the instance of concatenation, as a function of proper- 
ties of the used band of the later audio segment, is set in a 
band which begins immediately before the used band of the 
later audio segment and ends with same. 

2. The method according to Claim 1, characterised in that 

- the instance of concatenation is set in a band which lies in 
the vicinity of the boundaries of the initially to be used 
solo articulation band of the later audio segment, if the band 
of same to be used reproduces a static sound/phone at the be- 
ginning; and 

- a downstream portion of the band to be used of the earlier 
audio segment and an upstream portion of the band to be used 
of the later audio segment are processed by means of suitable 
transfer functions and added in an overlapping manner (cross 



fade) , with the transfer functions and the length of an over- 
lapping portion of the two bands being determined depending on 
the audio segments to be concatenated. 

3. The method according to Claim 1 or 2 , characterised in 
that 

- the instance of concatenation is set in a band which lies 
immediately before the band to be used of the later audio 
segment, if the used band of same reproduces a dynamic sound/ 
phone at the beginning; and 

- a downstream portion of the band to be used of the earlier 
audio segment and an upstream portion of the band to be used 
of the later audio segment are processed by means of suitable 
transfer functions and joined in a non- overlapping manner 
(hard fade) , with the transfer functions being determined 
depending on the acoustical data to be synthesised. 

4. The method according to one of Claims 1 to 3 , character- 
ised in that for a sound/phone or a portion of the sequence of 
concatenated sounds/phones at the start of the concatenated 
sound/phone sequence a band of an audio segment is selected so 
that the start of the band reproduces the properties of the 
start of the concatenated sound/phone sequence. 

5. The method according to one of Claims 1 to 4, character- 
ised in that for a sound/phone or a portion of the sequence of 
concatenated sounds/phones at the end of the concatenated 
sound/phone sequence a band of an audio segment is selected so 
that the end of the band reproduces the properties of the end 
of the concatenated sound/phone sequence. 

6. The method according to one of Claims 1 to 5, character- 
ised in that the voice data to the synthesised is combined in 
groups, each of which being described by an individual audio 
segment . 



-3 3 I 



7. The method according to one of Claims 1 to 6, character- 
ised in that an audio segment is selected for the later audio 
segment band, which reproduces the highest number of success- 
ive portions of the sounds/phones of the sound/phone sequence, 
in order to use the smallest number of audio segment bands in 
the generation of the synthesised acoustical data. 

8. The method according to one of Claims 1 to 7, character- 
ised in that a processing of the used bands of individual 
audio segments is carried out by means of suitable functions 
depending on properties of the concatenated sound/phone 
sequence, with these properties involving i.a. a modification 
of the frequency, the duration, the amplitude, or the spec- 
trum. 

9. The method according to one of Claims 1 to 8 , character- 
ised in that a processing of the used bands of individual 
audio segments is carried out by means of suitable functions 
in a band, in which the instance of concatenation lies, with 
these functions involving i.a. a modification of the frequen- 
cy, the duration, the amplitude, or the spectrum. 

10. The method according to one of Claims 1 to 9, character- 
ised in that the instance of concatenation is set in places of 
the bands to be used of the earlier and/or later audio seg- 
ment, in which the two used bands are in agreement with re- 
spect to one or several suitable properties, with these pro- 
perties including i.a.: zero point, amplitude values, gradi- 
ents, derivatives of any degree, spectra, tone levels, amplit- 
ude values within a frequency band, volume, style of speech, 
emotion of speech, or other properties covered in the phone 
classification scheme . 

11. The method according to one of Claims 1 to 10, character- 
ised in that 



- the selection of the used bands of individual audio seg- 
ments, their processing, their variation, as well as their 
concatenation are additionally carried out with the applica- 
tion of heuristic knowledge which is obtained by an addi- 
tionally carried out heuristic method. 

12. The method according to one of Claims 1 to 11, character- 
ised in that 

- the acoustical data to be synthesised is voice data, and the 
sounds are phones . 

13. The method according to one of Claims 2 to 12, character- 
ised in that 

- the static phones include vowels, diphtongs, liquids, 
vibrants, fricativen and nasals. 

14. The method according to one of Claims 3 to 13, character- 
ised in that and 

- the dynamic phones include plosives, affricates, glottal 
stops, and click sounds. 

15. The method according to one of Claims 1 to 14, character- 
ised in that 

- a conversion of the synthesised acoustical data to acous- 
tical signals and/or voice signals is carried out. 

16. A device for the co-articulation-specific concatenation 
of audio segments, in order to generate synthesised acoustical 
data which reproduces a sequence of phones, comprising: 

- a database (107) in which audio segments are stored, each of 
which reproducing portion of a phone or portions of a sequence 
of (concatenated) phones; 

- and/or any upstream synthesis means (108) which supplies 
audio segments; 



- a means (105) for the selection of at least two audio seg- 
ments from the database (107) and/or the upstream synthesis 
means (108) ; and 

- a means (111) for the concatenation of audio segments, 
characterised in that the concatenation means (111) is suited 
for 

- defining a band to be used of an earlier audio segment; 

- defining a portion to be used of a later audio segment in a 
band which starts with the later audio segment and ends after 
a co-articulation band of the later audio segment, which 
follows after the initially used solo articulation band; 

- determining the duration and position of the used bands de- 
pending on the earlier and later audio segments; and 

- concatenating the used band of the earlier audio segment 
with the used band of the later audio segment by defining the 
instance of concatenation as a function of properties of the 
used band of the later audio segment in a band which starts 
immediately before the used band of the later audio segment 
and ends with same. 

17. The device according to Claim 16, characterised in that 
the concatenation means (111) comprises: 

- means for the concatenation of the used band of the earlier 
audio segment with the used band of the later audio segment, 
whose used band reproduces a static phone at the beginning in 
the vicinity of the boundaries of the initially occurring solo 
articulation band of the used band of the later audio segment; 

- means for processing a downstream portion of the used band 
of the earlier audio segment and an upstream portion of the 
used band of the later audio segment by suitable transfer 
functions ; and 

- means for the overlapping addition of the two bands in an 
overlapping portion (cross fade) , which depends on the audio 
segments to be concatenated, with the transfer functions and 



the length of an overlapping portion of the two bands being 
determined depending on the acoustical data to be synthesised. 

18. The device according to Claim 16 or 17, characterised in 
that the concatenation (111) means comprises: 

- means for the concatenation of the used band of the earlier 
audio segment with the used band of the later audio segment, 
whose used band reproduces a dynamic phone at the beginning, 
immediately before the used band of the later audio segment; 

- means for processing a downstream portion of the used band 
of the earlier audio segment and an upstream portion of the 
used band of the later audio segment by suitable transfer 
functions, with the transfer functions being determined de- 
pending on the acoustical data to be synthesised; and 

- means for the non-overlapping joining of the two audio seg- 
ments . 

19. The device according to one of Claims 16 to 18, charac- 
terised in that the database (107) includes audio segments or 
the upstream synthesis means (108) supplies audio segments 
which comprise bands which at the start reproduce a phone or a 
portion of the concatenated phone sequence at the start of the 
concatenated phone sequence. 

20. The device according to one of Claims 16 to 19, charac- 
terised in that the database (107) includes audio segments or 
the upstream synthesis means (108) supplies audio segments 
which comprise bands, whose ends reproduce a phone or a por- 
tion of the concatenated phone sequence at the end of the 
concatenated phone sequence . 

21. The device according to one of Claims 16 to 19, charac- 
terised in that the database (107) includes a group of audio 
segments or the upstream synthesis means (108) supplies audio 



segments which comprise bands, whose starts each reproduce 
only a static phone. 

22. The device according to one of Claims 16 to 21, charac- 
terised in that the concatenation means (111) comprises: 

- means for the generation of further audio segments by con- 
catenation of audio segments, with the starts of the bands 
each reproducing a static phone, each with a band of a later 
audio segment whose used band reproduces a dynamic phone at 
the start, and 

- a means which supplies the further audio segments to the 
database (107) or the selection means (105) . 

23. The device according to one of Claims 16 to 22, charac- 
terised in that, in the selection of the audio segment bands 
from the database (107) or the upstream synthesis means (108) , 
the selection means (105) is suited to select the audio seg- 
ments which reproduce the greatest number of successive por- 
tions of concatenated phones of the concatenated phone se- 
quence . 

24. The device according to one of Claims 16 to 23, charac- 
terised in that the concatenation means (111) comprises means 
for processing the used bands of individual audio segments 
with the aid of suitable functions, depending on properties of 
the concatenated phone sequence, with the functions involving 
among others a modification of the frequency, the duration, 
the amplitude, or the spectrum. 

25. The device according to one of Claims 16 to 24, charac- 
terised in that 

- the concatenation means (111) comprises means for processing 
the used bands of individual audio segments with the aid of 
suitable functions in a band including the instance of conca- 



tenation, with this function involving i.a. a modification of 
the frequency, the duration, the amplitude, or the spectrum. 

26. The device according to one of Claims 16 to 25, charac- 
5 terised in that 

- the concatenation means (111) comprises means for the selec- 
tion of the instance of concatenation in a place in the used 
bands of the earlier and/or the later audio segment, in which 
the two used bands are in agreement with respect to one or 
10 several suitable properties, with these properties including 

i.a.: zero points, amplitude values, gradients, derivatives of 
q any degree, spectra, tone levels, amplitude values in a fre- 

^ quency band, volume, style of speech, emotion of speech, or 

m other properties covered in the phone classification scheme. 

Wl5 

\I 27. The device according to one of Claims 16 to 26, charac- 

y3 terised in that 

;L - the selection means (105) comprises means for the implement- 

j= ation of heuristic knowledge which relates to the selection of 

^2 0 the used bands of the individual audio segments, their pro- 

S cessing, their variation, as well as their concatenation. 

28. The device according to one of Claims 16 to 27, charac- 
terised in that 

2 5 - the database (107) includes audio segments or the upstream 

synthesis means (108) supplies audio segments which include 
bands, each of which reproducing at least a portion of a sound 
or phone, respectively, a sound or phone, respectively, por- 
tions of phone sequences or polyphones, respectively, or sound 

3 0 sequences or polyphones, respectively. 

29. The device according to one of Claims 17 to 28, charac- 
terised in that 

the data base (107) includes audio segments or the upstream 
35 synthesis means (108) supplies audio segments, with a static 



sound corresponding to a static phone and comprising vowels, 
diphtongs, liquids, vibrants, fricatives, and nasals. 

30. The device according to one of Claims 18 to 29, charac- 
terised in that 

- the database (107) includes audio segments or the upstream 
synthesis means (108) supplies audio segments, with a dynamic 
sound corresponding to a dynamic phone and comprising plos- 
ives, affricates, glottal stops, and klick speech. 

31. The device according to one of Claims 16 to 30, charac- 
terised in that 

- the concatenation means (111) is suitable to generate syn- 
thesised voice data by means of the concatenation of audio 
segments . 

32. The device according to one of Claims 16 to 31, charac- 
terised in that 

- means (117) are provided for the conversion of the synthe- 
sised acoustical data to acoustical signals and/or voice 
signals . 

33. A data carrier which includes a computer program for the 
co-articulation-specific concatenation of audio segments in 
order to generate synthesised acoustical data which reproduces 
a sequence of concatenated phones, comprising the following 
steps : 

- selection of at least two audio segments which contain 
bands, each of which reproducing a portion of a sound/phone or 
a portion of a sound/phone sequence, 

characterised by the steps of: 

- establishing a band to be used of an earlier audio segment; 

- establishing a band to be used of a later audio segment, 
which begins with the later audio segment and ends with the 
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co-articulation band of the later audio segment which follows 
the initially used solo articulation band; 

- with the duration and position of the bands to be used being 
determined as a function of the earlier and later audio seg- 
ments; and 

- concatenating the established band of the earlier audio seg- 
ment with the established band of the later audio segment, in 
that the instance of concatenation, as a function of proper- 
ties of the used band of the later audio segment, is set in 
its established band which starts immediately before the band 
to be used of the later audio segment and ends with same. 

34. The data carrier according to Claim 33, characterised in 
that the computer program selects the instance of the conca- 
tenation of the used band of the second audio segment with the 
used band of the first audio segment in such a manner that 

- the instance of concatenation is set in a band which lies in 
the vicinity of the boundaries of the initially used solo 
articulation band of the later audio segment, if its used band 
reproduces a static phone at the start; 

- a downstream portion of the used band of the earlier audio 
segment and an upstream portion of the used band of the later 
audio segment are processed by suitable transfer functions and 
added in an overlapping manner (cross fade) , with the transfer 
functions and the length of an overlapping portion of the two 
bands being determined depending on the audio segments to be 
concatenated . 

35. The data carrier according to Claim 33 or 34, charac- 
terised in that the computer program selects the instance of 
the concatenation of the used band of the second audio segment 
with the used band of the first audio segment in such a manner 
that 



- the instance of concatenation is set in a band which lies 
immediately before the used band of the later audio segment, 
if its used band reproduces a dynamic phone at the start; 

- a downstream portion of the used band of the earlier audio 
segment and an upstream portion of the used band of the later 
audio segment are processed by suitable transfer functions and 
added in a non-overlapping manner (hard fade) , with the trans- 
fer functions being determined depending on the audio segments 
to be concatenated. 

36. The data carrier according to one of Claims 33 to 35, 
characterised in that the computer program selects a band of 
an audio segment for a phone or a portion of the sequence of 
concatenated phones at the start of the concatenated phone 
sequence, the start of which reproduces the properties of the 
start of the concatenated sequence of phones. 

37. The data carrier according to one of Claims 33 to 36, 
characterised in that the computer program selects a band of 
an audio segment for a phone or a portion of the sequence of 
concatenated phones at the end of the concatenated phone 
sequence, the end of which reproduces the properties of the 
end of the concatenated sequence of phones . 

38. The data carrier according to one of Claims 33 to 37, 
characterised in that the computer program carries out a pro- 
cessing of the used bands of individual audio segments with 
the aid of suitable functions depending on properties of the 
phone sequence, with the functions involving i.a. modification 
of the frequency, the duration, the amplitude, or the 
spectrum . 

39. The data carrier according to one of Claims 33 to 38, 
characterised in that the computer program selects an audio 
segment band for the later audio segment band which reproduces 



the highest number of successive portions of the concatenated 
phones in the phone sequence, in order to use the smallest 
number of audio segment bands in the generation of the syn- 
thesised acoustical data. 

40. The data carrier according to one of Claims 3 9 to 45, 
characterised in that the computer program carries out a pro- 
cessing of the used bands of individual audio segments with 
the aid of suitable functions in a band in which the instance 
of concatenation lies, with these functions involving i.a. a 
modification of the frequency, the duration, the amplitude, or 
the spectrum. 

41. The data carrier according to one of Claims 33 to 40, 
characterised in that the computer program establishes the in- 
stance of concatenation in a place of the used bands of the 
first and/or the second audio segment, in which the two used 
bands are in agreement with respect to one or several suitable 
properties, with these properties including i.a.: zero points, 
amplitude values, gradients, derivatives of any degree, 
spectra, tone levels, amplitude values in a frequency band, 
volume, style of speech, emotion of speech, or other pro- 
perties covered in the phone classification scheme. 

42. The data carrier according to one of Claims 33 to 41, 
characterised in that the computer program carries out an 
implementation of heuristic knowledge which relates to the 
selection of the used bands of the individual audio segments, 
their processing, their variation, as well as their concatena- 
tion . 

43. The data carrier according to one of Claims 33 to 42, 
characterised in that the computer program is suited for the 
generation of synthesised voice data, with the sounds being 
phones . 



44. The data carrier according to one of Claims 34 to 42, 
characterised in that the computer program is suited for the 
generation of static phones, with the static phones comprising 
vowels, diphtongs, liquids, vibrants, fricatives, and nasals. 

45. The data carrier according to one of Claims 35 to 44, 
characterised in that the computer program is suited for the 
generation of dynamic phones, with the dynamic phones compris- 
ing plosives, affricates, glottal stops, and klick speech. 

46. The data carrier according to one of Claims 33 to 45, 
characterised in that the computer program converts the syn- 
thesised acoustical data to acoustical convertible data and/ 
or voice signals. 

47. Synthesised voice signals which consist of a sequence of 
sounds or phones, respectively, with the voice signals being 
generated in that : 

- at least two audio segments are selected which reproduce the 
sounds or phones, respectively; and 

- the audio segments are linked by a co-articulation-specific 
concatenation, with 

- one band to be used of an earlier audio segment being estab- 
lished; 

- one band to be used of a later audio segment being estab- 
lished which starts with the later audio segment and ends with 
the co-articulation band of the later audio segment, following 
the initially used solo articulation band; 

- with the duration and position of the bands to be used being 
determined depending on the audio segments; and 

- the used bands of the audio segments being concatenated in a 
co-articulation-specific manner, in that the instance of con- 
catenation, as a function of properties of the used band of 
the later audio segment, is set in a band which starts imme- 



diately before the used band of the later audio segment and 
ends with same. 

48. The synthesised voice signals according to Claim 47, 
characterised in that the voice signals are generated in that 

- the audio segments are concatenated in an instance which 
lies in the vicinity of the boundaries of the later audio 
segment, if the start of this band reproduces a static sound 
or phone, respectively, with the static phone being a vowel, a 
diphtong, a liquid, a fricative, a vibrant, or a nasal; and 

- a downstream portion of the used band of the earlier audio 
segment and an upstream portion of the used band of the later 
audio segment are processed by means of suitable transfer 
function and both bands are added in an overlapping manner 
(cross fade) , with the transfer functions and the length of an 
overlapping portion of the two bands being determined depend- 
ing on the audio segments to be concatenated. 

49. The synthesised voice signals according to Claim 47 or 
48, characterised in that the voice signals are generated in 
that 

- the audio segments are concatenated in an instance which 
lies immediately before the used band of the later audio seg- 
ment, if the start of this band reproduces a dynamic sound or 
phone, respectively, with the dynamic phone being a plosive, 
an affricate, a glottal stop, or klick speech; and 

- a downstream portion of the used band of the earlier audio 
segment and an upstream portion of the used band of the later 
audio segment are processed by means of suitable transfer 
functions and both bands are joined in a non-overlapping 
manner (hard fade) , with the transfer functions being determ- 
ined depending on the audio segments to be concatenated. 

50. The synthesised voice signals according to one of Claims 
47 to 49, characterised in that 
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- the first sound or the first phone, respectively, or a por- 
tion of the first phone sequence or of the first polyphone, 
respectively, in the sequence is generated by an audio seg- 
ment, whose used band at the start reproduces the properties 

5 of the start of the sequence. 

51. The synthesised voice signals according to one of Claims 
4 7 to 50, characterised in that 

- the last sound or the last phone, respectively, or a portion 
10 of the last phone sequence or of the last polyphone, respect- 
ively, in the sequence is generated by an audio segment, whose 
used band at the end reproduces the properties of the end of 
the sequence . 

15 52 . The synthesised voice signals according to one of Claims 

47 to 51, characterised in that 

- the voice signals are generated in that later bands of audio 
segments, beginning with the reproduction of a dynamic sound 
or phone, respectively, are concatenated with earlier bands of 

20 audio segments, beginning with the reproduction of a static 

sound or phone, respectively. 

53 . The synthesised voice signals according to one of Claims 
47 to 52, characterised in that 
2 5 - such audio segments are selected which reproduce the highest 

number of portions of sounds or phones, respectively, of the 
sequence, in order to use the smallest number of audio segment 
bands in the generation of the voice signals. 

30 54. The synthesised voice signals according to one of Claims 

47 to 53, characterised in that 

- the voice signals are generated by the concatenation of the 
used bands of audio segments which are processed with the aid 
of suitable functions depending on properties of the sound se- 

35 quence or phone sequence, respectively, with the functions in- 



volving i.a. a modification of the frequency, the duration, 
the amplitude, or the spectrum. 

55. The synthesised voice signals according to one of Claims 
5 47 to 54, characterised in that 

- the voice signals are generated by the concatenation of the 
used bands of audio segments which are processed with the aid 
of suitable functions depending on properties of the sound se- 
quence or phone sequence, respectively, in an area in which 
10 the instance of concatenation lies, with these properties 

including i.a. a modification of the frequency, the duration, 
the amplitude, or the spectrum. 

56. The synthesised voice signals according to one of Claims 
15 47 to 55, characterised in that the instance of concatenation 

lies at a place in the used bands of the earlier and/or the 
later audio segment, in which the two used bands are in agree- 
ment with respect to one or several suitable properties, with 
these properties including i.a.: zero points, amplitude 

2 0 values, gradients, derivatives of any degree, spectra, tone 

levels, amplitude values in a frequency band, volume, style of 
speech, emotion of speech, or other properties covered in the 
phone classification scheme. 

25 57. The synthesised voice signals according to one of Claims 

47 to 56, characterised in that the voice signals are suited 
for a conversion to acoustic signals. 

58. An acoustical, optical, magnetic, or electrical data 

3 0 storage which contains audio segments in order generate syn- 

thesised acoustical data by means of a concatenation of used 
bands of the audio segments, utilising the methods according 
to Claim 1, or the device according to Claim 16, or the data 
carrier according to Claim 33. 
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59. The data storage according to Claim 58, characterised in 
that a group of the audio segments reproduces sounds or 
phones, respectively, or portions of sounds or phones, res- 
pectively. 

60. The data storage according to Claim 58 or 59, character- 
ised in that a group of the audio segments reproduces phone 
sequences or portions of phone sequences or polyphones, res- 
pectively, or portions of polyphones. 

61. The data storage according to one of Claims 58 to 60, 
characterised in that a group of audio segments is provided 
whose used bands start with a static sound or phone, respect- 
ively, with the static phones comprising vowels, diphtongs, 
liquids, fricatives, vibrants, and nasals. 

62. The data storage according to one of Claims 58 to 61, 
characterised in that audio segments are provided which are 
suitable for the conversion to acoustical signals 

63. The data storage according to one of Claims 58 to 62, 
which additionally contains information in order to carry out 
a processing of the used bands of individual audio segments 
with the aid of suitable functions depending on properties of 
the acoustical data to be synthesised, with the functions in- 
volving i.a. a modification of the frequency, the duration, 
the amplitude, or the spectrum. 

64. The data storage according to one of Claims 58 to 63, 
which additionally contains information relating to a process- 
ing of the used bands of individual audio segments with the 
aid of suitable functions in a band in which the instance of 
concatenation lies, with this function involving i.a. a modi- 
fication of the frequency, the duration, the amplitude, or the 
spectrum. 
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65. The data storage according to one of Claims 58 to 64, 
which additionally provides linked audio segments, whose in- 
stance of concatenation lies at a place of the used bands of 
the earlier and/or later audio segment, where both used bands 

5 are in agreement with respect to one or several suitable pro- 

perties with these properties being i.a.: zero points, ampli- 
tude values, gradients, derivatives of any degree, spectra, 
tone levels, amplitude values in a frequency band, volume, 
style of speech, emotion of speech, or other properties cover- 
10 ed in the phone classification scheme. 

66. The data storage according to one of Claims 51 to 58, 
which additionally contains information in the form of 
heuristic knowledge, which relates to the selection of the 

15 used bands of the individual audio segments, their processing, 

their variation, as well as their concatenation. 

67. Sound carrier which contains data which at least partial- 
ly is synthesised acoustical data which were generated 

2 0 - by means of the method according to Claim 1, or 

- by means of the device according to Claim 16, or 

- by utilising the data carrier according to Claim 58, or 

- by utilising a data storage according to Claim 58, or 

- which are the voice signals according to Claim 47 . 



68. The sound carrier according to Claim 68, characterised in 
that the synthesised acoustical data is synthesised voice 
data . 



Abstract 

The invention enables the synthesising of any acoustical data 
by a concatenation of individual audio segment bands, with the 
instances in which the respective concatenation of two suc- 
cessive audio segment bands take place being established as a 
function of properties of the audio segments. In this manner, 
synthesised acoustical data can be generated which, after a 
conversion to acoustical signals, do not differ from corres- 
ponding, naturally generated acoustical signals. In particul- 
ar, the invention permits the generation of synthesised voice 
data under consideration of co-articulatory effects by means 
of a concatenation of individual voice audio segments. The 
voice data provided in this manner can be converted to voice 
signals which cannot be distinguished from a naturally spoken 
language . 
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Figure lb: Structure of a sound / phone 
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Figs. 2a to 21 Structures of the audio segments 
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Fig. 2g: 
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Fig. 2 j : 
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Figs. 3a to 3d: Concatenation 



Fig. 3a: 



Audio segment 1 



1-t 






CM 










< 






< 










M-t 

O 


EKB 


AKB 


o 


C 

O 


1 


2 


C 
O 


-H 






-H 


-P 






-P 


O 






o 


ft 






ft 




A 


J 




V 


V 





Phone 1 



Phone 2 



Audio segment 2 



CM 






ro 








< 


4-1 

o 


EKB 


AKB 


•4-4 

O 


c 
o 


2 


-y 
J 


C 

o 

-H 


H 
•P 






-P 


U 
O 
ft 






O 
ft 



V 

Phone 2 



"V" 
Phone 3 



Fig. 3al: 



Audio segment 1 



X 



Audio segment 2 



V~ 

Phone 1 



Phone 2 



1 — 1 






CM 


CM 






CO 


< 

CO 






< 


< 






pq 


<W 

o 


EKB 


AKB 


4-1 

o 


4-1 
O 


EKB 


AKB 


4-1 
O 


c: 
o 

-H 
■P 
S-i 


1 


2 


O 
H 
-P 
u 


o 
u 


2 




irtion 


O 
ft 






O 
ft 


o 
ft 






o 
ft 



Y 

Phone 



09 /7 p5 * 1 



Fig. 3b: 

Audio segment 1 



SMS 



r 



SAB 


EKB 


AKB 


1 


1 


2 



Y 

Phone 1 Phone 2 



Audio segment 2 

-A 



SAB 


EKB 


AKB 


2 


2 


3 









Y 

Phone 2 Phone 3 



Fig. 3bl: 



r 



Audio segment 1 Audio segment 2 

> 



SAB 


EKB 


akb\ 


/ SAB 


EKB 


AKB 


1 


1 


7 


Y 


2 


3 



Phone 1 



Phone 2 Phone 3 



Fig. 3bII: 



Audio segment 1 Audio segment 2 



SAB 


EKB 


AKB 


SAB 


EKB 


AKB 


! 


1 


2 


2 


^> 


*> 















Phone 1 





Phone 2 Phone 3 



9/4 6 



Fig. 3c: 



Audio segment 1 



V 

Phone 1 



Phone 2 



r 



Audio segment 2 



AKB 


SAB 


EKB 


AKB 




SAB 


EKB 


AKB 


1 


1 


1 


2 




2 


2 


3 



Phone 2 Phone 



Fig- 3d: 



Audio segment 1 



Audio segment 2 



AKB 


SAB 


EKB 


akb\ 


/ SAB 


EKB 


AKB 


1 


1 


1 




V 


2 


-> 



V 

Phone 1 



.A. 



Phone 2 



Phone 3 



Fig. 3cII: 



Audio segment 1 



Audio segment 2 

A 



AKB 


SAB 


EKB 


AKB 


SAB 


EKB 


AKB 


1 


1 


] 


2 


2 


9 


3 



— V 

Phone 1 



Phone 2 



Phone 3 



09/763 14 



Fig. 3d: 



Audio segment 1 



r 



Y V 
Phone 1 Phone 2 



J 



Audio segment 2 



EKB 


AKJB 


SAB 




EKB 


AKB 


SAB 


1 


2 


*> 




2 


n 


3 



Phone 2 Phone 3 



Fig. 3dl: 



Aud i o s e gment 1 



Audio segment 2 



EKB 


AKB 


SAB 


y EKB 


■ ■ 
AKB 


SAB 


1 


2 


7 


V 




-* 

:> 



Phone 1 



— v" 

Phone 2 



V 

Phone 3 



Fig. 3dII: 



Audio segment 1 Audio segment 2 

^ — 



EKB 


i 

AKB 


SAB 


EKB 


AKB 


SAB 


1 


2 




2 




*> 



Phone 1 



Phone 2 



-Y 
Phone 3 



09/7«S 14 9 



Fig. 3e: 



Audio segment 1 




Audio segment ?. 







SAB 


EKB 


2 


2, 

i 



Phone 2 



Phone 2 



Fig. 3el: 



Audi'? segment 1 Audio segment 2 




Phone 2 



Fig. 3eII: 



Audio segment 1 Audio segment 2 

^ . 



r 



r 










SAB 


EKB 


AKB 


SAB 


EKB 






2 


2 


2 


1 


1 





Y 

Phone 1 



— 





Phone 2 



09/763 149 



u 



CD 




O 




a 




cd 








o< 




CD 




cn 






o 






CD 


a, 


a 




o 








a, 




M-l 




o 


v. 




H 






a o 


£ 




1— 1 


a, 



CO 



(D 

-P 



a 

CD 
-P 
cq 



t— 1 






CO 






< 






-p 










4-> 


CD 




M 




a 


(0 






-P 


0) 




CQ 


CO 








rH 


TJ 


O 




H 


•H 


o 


0 


-a 








Cm 




cd 




td 




CO 






CD 




cd 


O 










m 




m 


o 


o 






M 


CO 




£u 


CD 


o 


CD 


-p 


-H 


M 


cd 


-P 




a 


a 




-H 


a) 


O 




i — i 




-H 


CD 


,G 


M 


CO 


5 


O 











CQ 


! 


Q) 




< 










-rH 






-P 










•H 


cd 




CD 


M 


a 






O 














CP 








CO 




cd 


1 








-r~\ 


O 


" — ' 






•H 


O 




CQ 








< 




P-t 


CD 




cd 










CO 


xs 


-p 




CD 


M 


•H 


cd 


o 


O 












4-1 








o 


O 


cd 


CD 




u 




-P 




a 


e 


cd 


o 


CD 


o 


a 




M 




CD 


-p 




<4H 


-P 


o 


X! 




cd 


CD 


O 


CO 


o 


rH 


•H 


CD 


£ 


CD 




-P 


o 


CQ 




cd 


o 



a, 

CD 

-p 

CQ 




i CD 



o 




-H 




-a 








<d 




4-* 


CQ 


0 


< 


CD 


-P 






cd 


CD 


M 


e 


O 




-P 




CQ 


CO 



CD 

-p 

CQ 



o 




-H 












cd 


j—\ 




CQ 


O 


< 


CD 


-P 


tn 




cd 


CD 


M 




O 




-P 


0) 


CQ 


CO 




09/763 149 



CM 
-P 

s 



tn 

■H 
fa 



<D 
P 
CO 



O 

a 

-P 

CO 

H 

d 

05 



o 

-H 

p — 

to -i— i 

<D go 

P < 



4-4 CD 
O to 

CD O 
Oh -H 

P 2 

cd 



o 



o 

-p 
(d 

CD 

p 
cd 
tn a 



•H 

T3 
cd 
(D 4-1 

pc; o 



o 
a 



CD 

o 
a 
cd 
p 

ci 

-r-l 

a 
cd 







o 




■H 




-P 


+ 


cd 


"I 1 






CD 


CQ 


p 


< 


(d 




o 


-P 






o 


CD 


o 








4-1 


0 


o 


to 


CD 


O 




-H 




p 






(d 








4-1 




O 


o 










o 


> 


-H 


cd 


p 




cd 






d 


CD 




P> 




rd 




o 






-H 


o 


xs 


o 


cd 




CD 


4-1 




o 



a) 
p 

CO 



CD 
Xl 
P 

4-i 

O 

Cd 
XS 

M 
CD 
P 
cd 



© CO 
X < 
P 

P 
4-< £ 
O CD 

S 

(=J CD 

-H m 
u 

CD O 

+J -H 

«H X* 

■H 2 

cd 



CD 




XI 




-P 




4h 




O 




xs 








cd 




X 




u 




CD 




-H 




1— 1 




M 


i— 1 


cd 


H- 


0 


-m 


CD 


CO 


X! 


< 


p 






P 


4-1 




O 


CD 












CD 


•H 


CQ 


M 




CD 


o 


P> 


-H 


i— 1 


XJ 


•rH 






cd 













u 




+ 


r 


Q- 


•I I 


! 


<D 




( 


_p 


CO 




CO 


< 


- 




X? 












(d 




















O - 




CO 


-H 






P 






cd 




CO 






p 


CD 






P 




CD 


cd 






o 




On 


d 




CD 


o 




co 


o 


1 


o 


4-i 




-H 


o 




XJ 








CD 




cd 


o 






d 




CD 


cd 




X! 


_p 




_p 


to 










4-! 






O 








X5 




tn 


d 






cd 




-H 






M 


CD 




CD 


0, 




P 


>i 




i — 1 


-P 




-H 






4-1 


d 






o 




iH 






cd 








d 




O 


-H 




-H 






P 


d 




■H 


CD 




X? 






xs 


CD 










1U.U4. vt!01_ 1 0:46 +4y By 62180015 

^iT^gHL ^^^^^.'-^THOFFgU.UESTHCFF ' +49 89 62180015 




Docket No 
87977.02SH01 



DeclaratidriWra Power of Attorney For Patent Application 

English Language Declaration 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name, 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, 

first and joint inventor (if plural names are listed below) of the subject matter which is claimed and for 

which a patent is sought on the invention entitled 

METHOD AND DEVICE FOR THE CONCATENATION OF AUDIOSEGMENTS, TAKING INTO 
ACCOUNT CO ARTICULATION 

the specification of which 

(check one) 

13 is attached hereto 

□ was filed on ^ as United States Application No. or PCT International 

Application Number 



and was amended on 



{if applicable) 



I hereby state that I have reviewed and understand the contents of the above identified specification, 
including the claims, as amended by any amendment referred to above. 

I acknowledge the duty to disclose to the United States Patent and Trademark Office all information 
known to me to be matenal to patentability as defined in Title 37, Code of Federal Regulations, 
Section 1.56- 

I hereby claim foreign priority benefits under Title 35, United States Code, Section 119(a)-(d) or 
Section 365(b) of any foreign application^} for patent or inventor's certificate, or Section 365(a) of 
any PCT Internationa! application which designated at ieast one country other than the United States, 
listed below and have also identified below, by checking the box, any foreign application for patent or 
inventor's certificate or PCT international application having a filing date before that of the application 
on which priority is claimed. 

Prior Foreign Appiication(s) Priority Not Claimed 

198 37 661.8 ^ PE August 19, 1998 □ 

(Number) (Country) (Day/Month/Year Filed) 
— - □ 

□ 



(Number) (Country) (Day/Month/Year Filed) 



(Number) (Country) (Day/Month/Year Filed) 



Form PTO-SB-01 (Modified) 



P02/REV02 



Patent and Trademark Office-U-S. DEPARTMENT OF COMMERCE 



10.04.2W1 10:48 
10-APR-2001 10^48 



+49 39 62180015 
WUESTHOFF&WUESTHOFF 



+4y 89 82180015 



8. U3,- 04 



I hereby claim the benefit under 35 U.S.C Section 119(e) of any United States provisional 
application(s) listed below: 



None 



(Application Serial No.) 



(Filing Date) 



(Application Serial Nq.) 



(Filing Date) 



(Application Serial No.) 



(Filing Date) 



I hereby claim the benefit under 35 U, S. C. Section 120 of any United States application(s), or 
Section 365(c) of any PCT Internationa! application designating the United States, listed below and, 
insofar as the subject matter of each of the claims of this application is not disclosed in the prior 
United States or PCT Internationa! application in the manner provided by the first paragraph of 35 
U.S.C. Section 112, I acknowledge the duty to disclose to the United States Patent and Trademark 
Office all information known to me to be material to patentability as defined in Title 37, C. F. R., 
Section 156 which became available between the filing date of the prior application and the national 
or PCT International filing date of this application; 



PCT/EP99/06081 



(Application Serial No,) 



August 19, 1999 



(Filing Date) 



Fending 



(Status) 

(patented, pending, abandoned) 



(Application Serial No ) 



(Filing Date) 



(Status) 

(patented, pending, abandoned) 



(Application Serial No.} 



(Filing Date) 



(Status) 

(patented, pending, abandoned) 



I hereby declare that all statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these statements 
were made with the knowledge that wiilfui false statements and the like so made are punishable by 
fine or imprisonment or both, under Section 1001 of Title 13 of the United States Code and that such 
willful false statements may jeopardize the validity of the application or any patent issued thereon. 



Form PTO-SB-01 (6-95) (Modified) 



Patent timet Trademark Offlce-U.S. DEPARTMENT OP COMMERCE 



10 134.2001 10:48 +49 8y S^lyU015 _ _ 

10-flPP-2001 10=48 WUESTHOFF&WUESTHOFF +49 39 62180015 b . 04/U4 

Kay© ^ v* o 



POWER OF ATTORNEY: As a named inventor, I hereby appoint the following - v attorney(s) and/or 
agent(s) to prosecute this application and transact all business in the Patent and Trademark Office 
^connected therewith, (list nam e and registration number) 

Reg, No. 26,730 



Thomas R. FitzGerald 
Ronald S* Kareken 
Lee J. Fleckenstein 
Laurence S. Roach 
Stephen J. Sand 
Ronald J, Kisicki 




Reg. No. 
Reg. No! 
Reg* No,j4Sj)44 
Reg, No._34Jl6_ 
Reg.No. 3&2fl^ 



Send Correspondence to: - Thomas *^&2& E ^ 

Jaeckle FleiSCjmiaim-^-^tugei. LLP 
39_St3ie_Sireet 

Rochester, N ew.-¥oxk_l 461 4- 1310 



p. Direct Telephone Calls to: (name and telephone number) 

!r Thomas R. FitzGerald, Esq. Tel (716) 26 2-3640 - Fax (7 16) 262-4133 



^ Full name of SoEe or first inventor j 
Christoph Buskies j 



Soie or first i nv®n tor *& signature j ^ 



Date 



Al*ciit r a$ $ c 2L 22 TSMIiunburgr Germany ( o\a*k« ~.f ^^&!T tk^t u **j/ G#^*~>«-*y ' l 

_____ ^ ■ 



Citizenship 
German 



Post Office- Address 
Same as Above 



Full name of second inventor, it any 



Second inventor's signature 



Date 



Residence 



Citizenship 



Post Office Address 



GESHMT SEITEN W4 



